Skip to content

feat(elasticsearch): optional ingest_pipeline for bulk writes#3252

Open
GunaPalanivel wants to merge 12 commits intodeepset-ai:mainfrom
GunaPalanivel:feat/2940-elasticsearch-ingest-pipeline
Open

feat(elasticsearch): optional ingest_pipeline for bulk writes#3252
GunaPalanivel wants to merge 12 commits intodeepset-ai:mainfrom
GunaPalanivel:feat/2940-elasticsearch-ingest-pipeline

Conversation

@GunaPalanivel
Copy link
Copy Markdown
Contributor

@GunaPalanivel GunaPalanivel commented Apr 28, 2026

What

Adds an optional ingest_pipeline parameter to ElasticsearchDocumentStore. When set, write_documents and write_documents_async pass it as the Elasticsearch bulk API pipeline argument so ingest pipelines (including inference processors) can run at index time. Default behavior is unchanged (None).

Why

Users who index into Elasticsearch with a pre-defined ingest pipeline had no way to attach that pipeline from Haystack. The parent discussion in #699 defers broader retrieval work; this change is the narrow index-time hook only.

How

  • New ingest_pipeline on ElasticsearchDocumentStore.__init__: non-empty strings only (after strip); whitespace-only raises ValueError.
  • helpers.bulk / helpers.async_bulk receive pipeline=... only when the value is set, so deletes and existing callers are unaffected.
  • to_dict / from_dict include the field like sparse_vector_field (missing key deserializes to None).

Testing

Unit tests cover serialization (test_to_dict, from_dict variants), validation, and mocked bulk calls to assert pipeline is passed or omitted. Retriever to_dict / from_dict expectations were updated for nested document store serialization.

To reproduce locally:

cd integrations/elasticsearch
hatch run test:unit
hatch run test:types

For formatting on Windows, if hatch run fmt fails on path handling, use:

hatch -e default run ruff check --fix <paths>
hatch -e default run ruff format <paths>

Test output:

117 passed, 182 deselected (unit tests on current main)

Type check:

Success: no issues found in 10 source files

Trade-offs

  • No per-call pipeline override on write_documents (only store-level). Can be added later if needed.
  • Pipeline creation and mapping design stay in Elasticsearch; Haystack does not validate that the pipeline exists.

Pass Elasticsearch bulk pipeline when set so ingest pipelines (e.g. inference) run at index time. Serialize in to_dict; omit when unset. Extend retriever serialization tests.

Fixes deepset-ai#2940

Made-with: Cursor
@GunaPalanivel GunaPalanivel requested a review from a team as a code owner April 28, 2026 16:54
@GunaPalanivel GunaPalanivel requested review from anakin87 and removed request for a team April 28, 2026 16:54
@github-actions github-actions Bot added type:documentation Improvements or additions to documentation integration:elasticsearch and removed type:documentation Improvements or additions to documentation labels Apr 28, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 28, 2026

Coverage report (elasticsearch)

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  integrations/elasticsearch/src/haystack_integrations/components/retrievers/elasticsearch
  bm25_retriever.py
  embedding_retriever.py
  integrations/elasticsearch/src/haystack_integrations/document_stores/elasticsearch
  document_store.py 373-383, 406-416, 480
  filters.py
Project Total  

This report was generated by python-coverage-comment-action

Add filter edge-case tests (comparisons, ranges, in/not in) and retriever init validation tests. filters.py reaches 100% line coverage; combined package unit coverage ~63%.

Made-with: Cursor
@github-actions github-actions Bot added the type:documentation Improvements or additions to documentation label Apr 28, 2026
@anakin87 anakin87 requested a review from davidsbatista April 29, 2026 09:51
@anakin87 anakin87 removed their request for review April 30, 2026 10:53
@davidsbatista
Copy link
Copy Markdown
Contributor

@GunaPalanivel, you did not add any integration tests against the ElasticSearch cloud instance.

I will do it and can take it over from here.

@GunaPalanivel
Copy link
Copy Markdown
Contributor Author

@GunaPalanivel, you did not add any integration tests against the ElasticSearch cloud instance.

I will do it and can take it over from here.

@davidsbatista - if you are fine I'll add integration tests against the ElasticSearch cloud instance.

@davidsbatista
Copy link
Copy Markdown
Contributor

@GunaPalanivel, you did not add any integration tests against the ElasticSearch cloud instance.
I will do it and can take it over from here.

@davidsbatista - if you are fine I'll add integration tests against the ElasticSearch cloud instance.

No need, thanks, your contribution was already helpful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integration:elasticsearch type:documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Index-time inference in ElasticsearchDocumentStore (optional / deferred)

2 participants